home *** CD-ROM | disk | FTP | other *** search
- Cheap HTML parser
- Jim Davis
- davis@dri.cornell.edu
- July 1994
-
- This is code for doing simple processing on HTML. I know there are bugs and
- limitations in the code, but it suffices for simple purposes. Among the
- limitations: This is an HTML parser, not an SGML parser - it does not
- accept a DTD, rather the model of HTML is built into the code. Also it does
- not validate the HTML - it will attempt to parse invalid documents, and the
- results are undefined if the document is in error.
-
- The source code is available as a compressed Unix tar file. It runs under
- perl 4.0 patch level 36. I don't know about other versions of perl. This
- directory contains:
-
- parse-html.pl
- A simple HTML parser written in perl. As it parses the HTML, it calls
- routines (which you may redefine) for each tag encountered, and for
- whitespace and content. You can redefine these routines so as to
- process the HTML document.
- html-to-ascii.pl
- Uses the HTML parser to generate a plain ASCII version of an HTML
- document.
- html-ascii.pl
- The actual routines to generate the ASCII.
- tformat.pl
- A lowlevel text formatter used for generating ASCII. More or less like
- a subset of nroff
- html-to-rfc.pl
- Uses the HTML parser to generate a plain ASCII version of an HTML,
- with special formatting requirements for Internet drafts and RFCs
- rfc.pl
- Additional routines required for RFC formatting (e.g. page headers and
- footers)
-
- Generating RFCs from HTML
-
- The RFC format requires there be a header and footer containing, among
- other things, the name of the authors, a short title, and so on. You
- specify values for these fields with META tags as shown by the following
- example.
-
- <META name="status" content="Internet Draft">
- <META name="title" content="Internet audio protocol">
- <META name="date" content="July 1983">
- <META name="author" content="Nixon, Haldeman">
-
- (The META tag is not officially part of HTML, it was proposed by Roy
- Fielding.) The tags should be in the HEAD.
-
- Known bugs
-
- * It can't parse the prolog (or whatever you call it) because it does
- not know how to ensure that the square brackets match, e.g. the
- following
-
- <!DOCTYPE HTML [
- <!entity % HTML.Minimal "INCLUDE"<
- <!-- Include standard HTML DTD --<
- <!ENTITY % html PUBLIC "-//connolly hal.com//DTD WWW HTML 1.8//EN"<
- %html;
- ]<
-
- * font tags (e.g. CODE, EM) cause an extra whitespace in output e.g.
- <TT>foo</TT> yields "foo ,".
-